Optical Character Recognition

The format of this site that we will create 4 different methods of image classification of handwritten digits. First we will do some Data Visualization to help understand the problem. Then we will train and test Neural Networks, Gradient Boosting, and Random Forests to see how they each perform with some tuning. Aftewords we will look at a majority vote between the three algorithms to see if we can have any improvement.

Note: You will see libraries loaded several times, and variables reused. This is because I ran these in 5 seperate IPython Notebooks and 1 R script(for 3D Data Visualization).

Data Visualization

In [45]:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

This IPython notebok is just to get a sense of the data and then summarize the results that we have seen attempting different machine learning methods. Below is a plot of some of the handwritten numbers by converting the numbers into matricies and then vertical and horizontal stacking the images.

In [44]:
from numpy import genfromtxt
#Load in the optical data
optical = genfromtxt('train.csv', delimiter=',')
#Trim of headers
optical = optical[1:][:]

a = np.vstack([printnum(optical[0][1:]),printnum(optical[1][1:]),printnum(optical[2][1:]),printnum(optical[15][1:]),printnum(optical[20][1:])])
b = np.vstack([printnum(optical[3][1:]),printnum(optical[4][1:]),printnum(optical[5][1:]),printnum(optical[16][1:]),printnum(optical[21][1:])])
c = np.vstack([printnum(optical[6][1:]),printnum(optical[7][1:]),printnum(optical[8][1:]),printnum(optical[17][1:]),printnum(optical[22][1:])])
d = np.vstack([printnum(optical[9][1:]),printnum(optical[10][1:]),printnum(optical[11][1:]),printnum(optical[18][1:]),printnum(optical[23][1:])])
e = np.vstack([printnum(optical[12][1:]),printnum(optical[13][1:]),printnum(optical[14][1:]),printnum(optical[19][1:]),printnum(optical[24][1:])])

f = np.hstack([a,b,c,d,e])

plt.imshow(f, cmap = plt.cm.Greys_r)
plt.axis('off')
plt.title('Handwritten Digits')
plt.savefig('HandwrittenDigits.png', bbox_inches = 'tight')

We can see that there are some difficult numbers such as the very first image and even the 5's can look like 6s. This suggests that we could see some difficulty in grabbing the outliers within the dataset. As always there are tradeoffs and maybe to grab htese outliers.

We will attempt Neural Networks which are known to be good for image classification because of their ability to find features within an image. While these can have large training times, we will at least attempt to tune a Neural Network to where the performance is good but not necessarily optimal.

Gradient Boosting may be helpful, althought this could add far to much complexity to catch the edge cases.

Also Random Forests may be a good option because if we use many(~500) random criteria within a decision tree we may be able to catch outliers as well.

In [46]:
#Read in data into a Data Frame
df = pd.read_csv("train.csv")
In [47]:
import matplotlib
matplotlib.style.use('ggplot')
In [52]:
df.ix[:,0].value_counts().plot(kind='bar')
plt.title("Histogram Plot of Labels")
plt.xlabel("Number Written")
plt.ylabel("Count")
Out[52]:
<matplotlib.text.Text at 0x1167bddd0>
We can see from the histogram that we have a good sample of all of the possible numbers. Now we will work to use PCA to simply the data so we can see it in 3D. Could be possible that there is enough information contained within the first three principal components that we could use a nice simple machine learning program such as KNN! Note: the following was completed in R.

Principal Component Analysis and 3D Plot from R

digit_reader-R_Markdown

One thing that we wanted to see was what this would look like in a 3D Plot. Here we will use R because of it’s rgl package, which allows for fast 3D plotting. But first we have to use Principal Component Analysis to reduce the number of Dimensions.

library(bpca)
library(rgl)
library(ggfortify)
library(ggplot2)
library(RColorBrewer)
library(knitr)
library(rglwidget)
knit_hooks$set(webgl = hook_webgl)

setwd("/Users/bobminnich/Documents/Columbia/Courses/Data_Mining/Examples/DigitReader")
data = as.data.frame(read.csv("train.csv", header = TRUE, sep = ","))
labels2 = data[,1]
labels_frame <- as.data.frame(data[,1])
labels_frame <- setNames(labels_frame, c("l"))
data_r = data[,-1]
pca = prcomp(data_r)
plot.new()
screeplot(pca, main = "PCA Plot of NIST data", type = "lines")

We can see from the Screeplot that we might not be able to use 3 Principal components in order to reduce the number of dimensions but we will continue to see if we can understand anything visually from the plot.

dev.off()
## null device 
##           1
colorpal = c("#E41A1C", "#0066ff", "#4DAF4A", "#984EA3", "#FF7F00", "#FFFF33", "#A65628","#ff37cb","#66ff33", "#00ffff")

#Find colors associtated with labels and apply the color palet
for(i in 1:10){
  labels_frame$color[data$label == i-1] = colorpal[i]
}
#Used {r testgl, webgl=TRUE, } for R Chunk
#Plotting
plot3d(pca$x[,1:3],col = labels_frame$color, size = 1)
Legend : Scroll to Zoom, Click and drag to rotate
Num 0 Num 1 Num 2 Num 3 Num 4
Num 5 Num 6 Num 7 Num 8 Num 9

You must enable Javascript to view this page properly.

We end up with a very cool looking plot that allows us to turn and zoom on specific areas. One thing that is quite noticable is that the 1s are someone out on their own. This makes sense because out of all of the numbers the 1 is proably the most unquie of them.

OCR-Gradient_Boosting

Gradient Boosting

Here we will use the Gradient Boosting Algorithm to perform image classification

In [22]:
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
import sklearn
import numpy as np
import matplotlib
import pandas as pd
import numpy as np
from sklearn import cross_validation
from sklearn import datasets
from sklearn import preprocessing
from sklearn.learning_curve import learning_curve
from sklearn.learning_curve import validation_curve
from sklearn import ensemble
In [8]:
clf_gb = ensemble.GradientBoostingClassifier(random_state = 1, n_estimators=250)
In [9]:
start_time = time.time()
clf_gb.fit(X_train, y_train)
print("--- %s seconds ---" % (time.time() - start_time))
--- 1685.95019102 seconds ---
In [10]:
clf_gb.score(X_test,y_test)
Out[10]:
0.95880952380952378

Note: I was having quite a bit of trouble running Grid Search on my computer. It seemed like it took far to long so I went with the an Gradient BOost that had a high number of estimators. Needs to be optimized but could be done once I take my class on Distributed Computers

In [17]:
scaler = preprocessing.StandardScaler()  
X_final = scaler.fit_transform(df.ix[:,1:])  
y_final = df.ix[:,0]

start_time = time.time()
clf_gb.fit(X_final, y_final)
print("--- %s seconds ---" % (time.time() - start_time))
--- 2141.05330396 seconds ---
In [19]:
#Load, Scale, and output data to be uploaded to Kaggle
df_test = pd.read_csv("test.csv")
df_test = scaler.transform(df_test)
prediction = clf_gb.predict(df_test)
pred_df = pd.DataFrame(prediction)
pd.DataFrame.to_csv(pred_df,"output_GradientBoost.csv")
In the end we had a score of 0.95857 from Kaggle, which was very close to what we used during testing. This number could be improved by adjusting the number of estimators and learning rate. We did not explore this because of the length of time it took to train the Gradient Boosting algorithm. What we have observed though that it worked very well for image classification without any tuning.
OCR-NeuralNetworks

Neural Networks

Now we will attempt to use the same data set on Neural Networks.

In [19]:
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
import sklearn
import numpy as np
import matplotlib
import pandas as pd
import numpy as np
from sklearn import cross_validation
from sklearn import datasets
from sklearn import preprocessing
from sklearn.neural_network import MLPClassifier
from sklearn import model_selection
In [6]:
#Read in data into a Data Frame
df = pd.read_csv("train.csv")

#Split data into training and testing sets
X_train, X_test, y_train, y_test = cross_validation.train_test_split(df.ix[:,1:],df.ix[:,0] , test_size=0.2, random_state=1)
In [293]:
from sklearn.preprocessing import StandardScaler
#Create Scaler to standardize by scaling and shifting the data
scaler = StandardScaler()  

#Train scaler only on training data then transform
scaler.fit(X_train)  
X_train = scaler.transform(X_train) 

#Transform X_test using trained scaler
X_test = scaler.transform(X_test)  
X_total = scaler.transform(df.ix[:,1:])  
The following information was obtained during a long training sessions using Grid Search to determine what the best parameters might be for a Neural Network with OCR. Due to a long run time this was done in a seperate file so the following code could be run without a long wait time. Note that I used the Python library Pickle to open the results of completing Gride Search because of the amount of time that it took. Below is the results
In [17]:
import pickle
testv = pickle.load( open( "save.p", "rb" ) )
testv
Out[17]:
[mean: 0.90539, std: 0.00408, params: {'alpha': 0.10000000000000001, 'hidden_layer_sizes': (10, 5)},
 mean: 0.92839, std: 0.00319, params: {'alpha': 0.10000000000000001, 'hidden_layer_sizes': (30, 5)},
 mean: 0.94250, std: 0.00269, params: {'alpha': 0.10000000000000001, 'hidden_layer_sizes': (50, 5)},
 mean: 0.94036, std: 0.00333, params: {'alpha': 0.10000000000000001, 'hidden_layer_sizes': (60, 5)},
 mean: 0.89875, std: 0.00425, params: {'alpha': 0.01, 'hidden_layer_sizes': (10, 5)},
 mean: 0.92765, std: 0.00330, params: {'alpha': 0.01, 'hidden_layer_sizes': (30, 5)},
 mean: 0.93625, std: 0.00689, params: {'alpha': 0.01, 'hidden_layer_sizes': (50, 5)},
 mean: 0.94140, std: 0.00350, params: {'alpha': 0.01, 'hidden_layer_sizes': (60, 5)},
 mean: 0.90979, std: 0.00684, params: {'alpha': 0.001, 'hidden_layer_sizes': (10, 5)},
 mean: 0.92872, std: 0.00627, params: {'alpha': 0.001, 'hidden_layer_sizes': (30, 5)},
 mean: 0.94214, std: 0.00301, params: {'alpha': 0.001, 'hidden_layer_sizes': (50, 5)},
 mean: 0.94310, std: 0.00554, params: {'alpha': 0.001, 'hidden_layer_sizes': (60, 5)},
 mean: 0.90366, std: 0.00582, params: {'alpha': 0.0001, 'hidden_layer_sizes': (10, 5)},
 mean: 0.92720, std: 0.00148, params: {'alpha': 0.0001, 'hidden_layer_sizes': (30, 5)},
 mean: 0.93854, std: 0.00433, params: {'alpha': 0.0001, 'hidden_layer_sizes': (50, 5)},
 mean: 0.93878, std: 0.00589, params: {'alpha': 0.0001, 'hidden_layer_sizes': (60, 5)},
 mean: 0.90548, std: 0.00804, params: {'alpha': 1.0000000000000001e-05, 'hidden_layer_sizes': (10, 5)},
 mean: 0.92923, std: 0.00219, params: {'alpha': 1.0000000000000001e-05, 'hidden_layer_sizes': (30, 5)},
 mean: 0.93637, std: 0.00659, params: {'alpha': 1.0000000000000001e-05, 'hidden_layer_sizes': (50, 5)},
 mean: 0.94155, std: 0.00586, params: {'alpha': 1.0000000000000001e-05, 'hidden_layer_sizes': (60, 5)},
 mean: 0.90893, std: 0.00368, params: {'alpha': 9.9999999999999995e-07, 'hidden_layer_sizes': (10, 5)},
 mean: 0.93199, std: 0.00341, params: {'alpha': 9.9999999999999995e-07, 'hidden_layer_sizes': (30, 5)},
 mean: 0.93708, std: 0.00541, params: {'alpha': 9.9999999999999995e-07, 'hidden_layer_sizes': (50, 5)},
 mean: 0.94179, std: 0.00524, params: {'alpha': 9.9999999999999995e-07, 'hidden_layer_sizes': (60, 5)}]
In [279]:
#Load Best fro
alpha = 0.001
hidden_layer_sizes = (60,5)
clf = MLPClassifier(algorithm='l-bfgs', alpha=alpha, hidden_layer_sizes=hidden_layer_sizes)
In [282]:
def plotter(train_scores,test_scores,train_sizes,xlabel,ylabel,title,filename):
    #plt.title(title)
    plt.xlabel(xlabel)
    plt.ylabel(ylabel)
    train_scores_mean = np.mean(train_scores, axis=1)
    train_scores_std = np.std(train_scores, axis=1)
    test_scores_mean = np.mean(test_scores, axis=1)
    test_scores_std = np.std(test_scores, axis=1)
    plt.grid()

    plt.fill_between(train_sizes, train_scores_mean - train_scores_std,train_scores_mean + train_scores_std, alpha=0.1,color="r")
    plt.fill_between(train_sizes, test_scores_mean - test_scores_std, test_scores_mean + test_scores_std, alpha=0.1, color="g")
    plt.plot(train_sizes, train_scores_mean, 'o-', color="r",label="Training score")
    plt.plot(train_sizes, test_scores_mean, 'o-', color="g",label="Testing score")
    plt.legend(loc='center left', bbox_to_anchor=(1.0, 0.5))
    plt.title(title)
    plt.savefig(filename)
    plt.show()
In [280]:
from sklearn.learning_curve import learning_curve
train_sizes, train_scores, test_scores = learning_curve(clf, X_train, y_train, train_sizes = np.linspace(.1,.99, num = 10), cv=5)
In [288]:
import matplotlib.pyplot as plt
plotter(train_scores,test_scores,train_sizes,"Training Examples", "Score", "Training Curves using 5 K-fold", "Learning_Curves");

The Learning Curves shows us that we continue to see an increasing in the performance of the model as we add more and more samples, which could be due to a complex model that begins to decrease its variance with more samples. This could also potnetially be corrrected if we increaste the regularization penalty if more samples are not available.

We can see from this plot that the amount of

In [ ]:
from sklearn.learning_curve import validation_curve
train_scores_val, test_scores_val = validation_curve(clf, X_train,y_train, param_name="alpha", param_range=10.0 ** -np.arange(1, 7), cv = 5,scoring="accuracy")
In [289]:
plotter(train_scores_val,test_scores_val,10.0 ** -np.arange(1, 7),"Alpha L2 Penalty", "Score", "Validation Curves using 5 K-fold","Training_Curves");

The Validation Curves tells us that if we keep increasing the L2 penalty (Regularization) we may continue to see an increase in the performance of the Neural Network. There could be a large benefit of creating a more complex model using additional hidden layers. However the drawback to this is that the training time becomes larger as the complexity of the Neural Network (Additional Layers) increases.

Once I take the class using distributed systems I will look at creating plots with increasing complexity

In [8]:
alpha = 0.1
hidden_layer_sizes = (60,5)
clf = MLPClassifier(algorithm='l-bfgs', alpha=alpha, hidden_layer_sizes=hidden_layer_sizes)
In [9]:
import time
scaler = preprocessing.StandardScaler()  
X_final = scaler.fit_transform(df.ix[:,1:])  
y_final = df.ix[:,0]

start_time = time.time()
clf.fit(X_final, y_final)
print("--- %s seconds ---" % (time.time() - start_time))
--- 27.0941269398 seconds ---
In [11]:
#Load, Scale, and output data to be uploaded to Kaggle
df_test = pd.read_csv("test.csv")
df_test = scaler.transform(df_test)
prediction = clf.predict(df_test)
pred_df = pd.DataFrame(prediction)
pd.DataFrame.to_csv(pred_df,"output_NeuralNetowrk.csv")
The final score from Kaggle was 0.95829. Which did pretty good even though we think that there could be improvement on increasing the complexity by adding more nodes.
OCR-RandomForests

Random Forests

Here we will use Random Forests to attemp image classification

In [234]:
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
import sklearn
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from sklearn import cross_validation
from sklearn import datasets
from sklearn import preprocessing
from sklearn.preprocessing import StandardScaler  
from sklearn import model_selection
In [3]:
df = pd.read_csv("train.csv")
X_train, X_test, y_train, y_test = cross_validation.train_test_split(df.ix[:,1:],df.ix[:,0] , test_size=0.2, random_state=1)
scaler = StandardScaler()  
scaler.fit(X_train)  
X_train = scaler.transform(X_train)  
X_test = scaler.transform(X_test)  
In [17]:
from sklearn import tree
from sklearn import ensemble
clf = ensemble.RandomForestClassifier(n_estimators=1)
clf.fit(X_train,y_train)
Out[17]:
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=1, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)
In [16]:
def plotter(train_scores,test_scores,train_sizes,xlabel,ylabel,title,filename):
    #plt.title(title)
    plt.xlabel(xlabel)
    plt.ylabel(ylabel)
    train_scores_mean = np.mean(train_scores, axis=1)
    train_scores_std = np.std(train_scores, axis=1)
    test_scores_mean = np.mean(test_scores, axis=1)
    test_scores_std = np.std(test_scores, axis=1)
    plt.grid()
    plt.fill_between(train_sizes, train_scores_mean - train_scores_std,train_scores_mean + train_scores_std, alpha=0.1,color="r")
    plt.fill_between(train_sizes, test_scores_mean - test_scores_std, test_scores_mean + test_scores_std, alpha=0.1, color="g")
    plt.plot(train_sizes, train_scores_mean, 'o-', color="r",label="Training score")
    plt.plot(train_sizes, test_scores_mean, 'o-', color="g",label="Testing score")
    plt.legend(loc='center left', bbox_to_anchor=(1.0, 0.5))
    plt.title(title)
    plt.savefig(filename)
    plt.show()
In [20]:
clf.score(X_test,y_test)
Out[20]:
0.79559523809523813
In [33]:
from sklearn.learning_curve import validation_curve
prange = [1,100,250,500,750,1000]
#train_scores_val, test_scores_val = validation_curve(clf, X_train,y_train, param_name="n_estimators", param_range = prange, cv = 5,scoring="accuracy")
/usr/local/lib/python2.7/site-packages/sklearn/learning_curve.py:23: DeprecationWarning: This module has been deprecated in favor of the model_selection module into which all the functions are moved. This module will be removed in 0.20
  DeprecationWarning)
In [46]:
import pickle
#pickle.dump(train_scores_val, open( "train_scores_val.p", "wb" ))
#pickle.dump(test_scores_val, open( "test_scores_val.p", "wb" ))

train_scores_val = pickle.load( open( "train_scores_val.p", "rb" ) )
test_scores_val = pickle.load( open( "test_scores_val.p", "rb" ) )
In [49]:
plotter(train_scores_val,test_scores_val,prange,"Number of Trees","Accuracy","Random Forests - Validation 5Kfold ","RandomForests_Val")

As we can see from the plot there is not much of a benefit from having over 100 trees and because past 100 trees the training time increases greatly, we will stick with 100 for the rest of the tuning process.

In [52]:
from sklearn.learning_curve import learning_curve
clf = ensemble.RandomForestClassifier(n_estimators=100)

train_sizes, train_scores, test_scores = learning_curve(clf, X_train, y_train, train_sizes = np.linspace(.1,.99, num = 10), cv=5)
In [71]:
pickle.dump(test_scores, open( "test_scores_learn.p", "wb" ))
pickle.dump(train_scores, open( "train_scores_learn.p", "wb" ))
In [53]:
plotter(train_scores,test_scores,train_sizes,"Training Examples", "Score", "Training Curves using 5 K-fold", "RandomForests_Learning_Curves");

As we saw with the Neural Network, we could probably improve the performance if we were able to gain more samples. We will now look to keep tuning the Random Forests by adjusting the __

In [68]:
import time
start_time = time.time()
clf = ensemble.RandomForestClassifier(n_estimators=100, n_jobs=-1, oob_score = True)
#Unlimmited --- 4.14583897591 seconds ---
#1 Processor --- 20.5791888237 seconds ---

clf.fit(X_train,y_train)
print("--- %s seconds ---" % (time.time() - start_time))
--- 8.01108717918 seconds ---
In [77]:
param_grid = [{'min_samples_leaf': [1,20,50,100,250,500,1000,2000,3000], 'max_features': ["auto","sqrt","log2", None]}]
clf_grid = sklearn.model_selection.GridSearchCV(clf,param_grid, cv = 5)
In [78]:
clf_grid.fit(X_train,y_train)
Out[78]:
GridSearchCV(cv=5, error_score='raise',
       estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=-1,
            oob_score=True, random_state=None, verbose=0, warm_start=False),
       fit_params={}, iid=True, n_jobs=1,
       param_grid=[{'max_features': ['auto', 'sqrt', 'log2', None], 'min_samples_leaf': [20, 50, 100, 250, 500, 1000, 2000, 3000]}],
       pre_dispatch='2*n_jobs', refit=True, scoring=None, verbose=0)
In [176]:
clf_grid.grid_scores_ = pickle.load( open( "clf_grid_scores_.p", "rb" ) )

Here is an example of what the output of the Grid Scores looks like

In [232]:
clf_grid.grid_scores_[1:5]
Out[232]:
[mean: 0.93801, std: 0.00222, params: {'max_features': 'auto', 'min_samples_leaf': 20},
 mean: 0.92220, std: 0.00364, params: {'max_features': 'auto', 'min_samples_leaf': 50},
 mean: 0.90533, std: 0.00216, params: {'max_features': 'auto', 'min_samples_leaf': 100},
 mean: 0.87500, std: 0.00418, params: {'max_features': 'auto', 'min_samples_leaf': 250}]
In [178]:
mean_list = list()
cv_list = list()
maxfeat_list = list()
min_leaf_list = list()
for i in range(0,len(clf_grid.grid_scores_)):
    mean_list.append(clf_grid.grid_scores_[i][1])
    cv_list.append(clf_grid.grid_scores_[i][2])
    maxfeat_list.append(clf_grid.grid_scores_[i][0]['max_features'])
    min_leaf_list.append(clf_grid.grid_scores_[i][0]['min_samples_leaf'])
In [180]:
for i in range(0,len(mean_list),9):
    plt.plot(min_leaf_list[i:i+5],mean_list[i:i+5])
plt.legend(np.unique(maxfeat_list),loc=10, bbox_to_anchor=(1.125, 0.80))
Out[180]:
<matplotlib.legend.Legend at 0x154fdbb50>

It shows that the minimum samples per leaf performs much better at a lower number (1) and using "auto" to determine maximum features. This was very similar to None (which I believe this is suggesting that we need a very complex model with n maximum features).

One thing that is surprising is that sklearn says we should have the same results with "Auto" and "sqrt" which we are not seeing from this plot.

In [181]:
#from sklearn.ensemble import ExtraTreesClassifier
#clf_extra = ExtraTreesClassifier(n_estimators=100, max_depth=None,min_samples_split=1, random_state=0)
#clf_extra.fit(X_train,y_train)
#clf.score(X_test,y_test)
In [203]:
clf_final = ensemble.RandomForestClassifier(n_estimators=100, max_features= "auto",min_samples_leaf = 1, n_jobs=-1 )
In [ ]:
 
In [186]:
scaler = StandardScaler()  
X_final = scaler.fit_transform(df.ix[:,1:])  
y_final = df.ix[:,0]
In [208]:
clf_final.fit(X_final,y_final)
Out[208]:
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=-1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)
In [206]:
df_test = pd.read_csv("test.csv")
df_test = scaler.transform(df_test)
In [209]:
prediction = clf_final.predict(df_test)
In [219]:
pred_df = pd.DataFrame(prediction)
In [220]:
pd.DataFrame.to_csv(pred_df,"output.csv")

Results gave a 0.96600

OCR-Majority_Vote

Majority Vote Approach

Now we will take a look at combining all 3 types of aglrotihms tested to see if we can improve the performance by simply taking a majority vote between the 3 algorithms run.

In [68]:
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
import sklearn
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
In [2]:
df_NN = pd.read_csv("output_NeuralNetowrk.csv")
df_RF = pd.read_csv("output_RandomForests.csv")
df_GB = pd.read_csv("output_GradientBoost.csv")
In [7]:
#Rename Data Frames
df_NN.columns=["ImageID","Label_NN"]
df_RF.columns=["ImageID","Label_RF"]
df_GB.columns=["ImageID","Label_GB"]
In [25]:
#Combine the Data Frames
df_com = pd.concat([df_NN,df_RF["Label_RF"],df_GB["Label_GB"]],axis = 1)
In [67]:
#Addd Final Column so we can place the final majority vote here
df_com['Final'] = pd.Series(0, index=df_com.index) 

We will now show a sample of where the algorithms may not have agreed on their output. Then we will wor to take the majority vote and place it in the Final column

In [66]:
Mis_Matched_Reults = df_com[(df_com.Label_RF != df_com.Label_NN) | (df_com.Label_NN != df_com.Label_GB)| (df_com.Label_RF != df_com.Label_GB)]
Mis_Matched_Reults.ix[15:101,:]
Out[66]:
Label_NN Label_RF Label_GB Final
15 5 3 3 3
30 7 7 4 7
36 5 5 9 5
47 2 8 8 8
59 0 5 0 0
76 5 8 3 0
81 4 4 9 4
101 9 3 9 9
In [52]:
#Use Boolean to complete a voting for each Machine Learning algorithm. If either of the results appears 2 times, than we will go with that resu
df_com.ix[(df_com.Label_RF == df_com.Label_NN),"Final"] = df_com.ix[(df_com.Label_RF == df_com.Label_NN),"Label_RF"]

df_com.ix[(df_com.Label_RF == df_com.Label_GB),"Final"] = df_com.ix[(df_com.Label_RF == df_com.Label_GB),"Label_RF"]

df_com.ix[(df_com.Label_GB == df_com.Label_NN),"Final"] = df_com.ix[(df_com.Label_GB == df_com.Label_NN),"Label_GB"]
In [55]:
#Change for output for Kaggle Competition
pd.DataFrame.to_csv(df_com.ix[:,(0,4)],"Combined.csv")

Here are the final results of all the algorithms
Neural = 0.95829
Random = 0.96600
Gradient Boosting = 0.95857
Majority Vote =0.96729

What we have seen here is that without much tuning(if the training time was too long we moved on) we were able to get great results very quickly using any of the 3 algorithms. If we combined all three and took a majority vote, we saw an extra increase in performance.

No doubt more time could be spent on each of these algorithms to tune and improve performance but overall it was a good learning exercise in practicing Data Science